arXiv: 1509.04093v2 [math.ST] 27Jun2016 


Sharp Oracle inequalities for square root 

regularization. 

Benjamin Stucky and Sara van de Geer 

June 28, 2016 


Seminar fiir Statistik 
ETH Zurich 
Switzerland 

Abstract 

We study a set of regularization methods for high-dimensional lin¬ 
ear regression models. These penalized estimators have the square root 
of the residual sum of squared errors as loss function, and any weakly 
decomposable norm as penalty function. This fit measure is chosen 
because of its property that the estimator does not depend on the 
unknown standard deviation of the noise. On the other hand, a gener¬ 
alized weakly decomposable norm penalty is very useful in being able 
to deal with different underlying sparsity structures. We can choose a 
different sparsity inducing norm depending on how we want to inter¬ 
pret the unknown parameter vector /3. Structured sparsity norms, as 
defined in Micchelli et al. [18], are special cases of weakly decompos¬ 
able norms, therefore we also include the square root LASSO (Belloni 
et al. [3]), the group square root LASSO (Bunea et al. [10]) and a 
new method called the square root SLOPE (in a similar fashion to 
the SLOPE from Bogdan et al. [6]). For this collection of estimators 
our results provide sharp oracle inequalities with the Karush-Kuhn- 
Tucker conditions. We discuss some examples of estimators. Based on 
a simulation we illustrate some advantages of the square root SLOPE. 

Square Root LASSO, Structured Sparsity, Karush-Kuhn-Tucker, Sharp Ora- 
cale Inequality, Weak Decomposability. 

1 Introduction and Model 

The recent development of new technologies makes data gathering not a big 
problem any more. In some sense there is more data than we can handle, 
or than we need. The problem has shifted towards finding useful and mean¬ 
ingful information in the big sea of data. An example where such problems 
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arise is the high-dimensional linear regression model 

F = X/3° + e. (1.1) 

Here Y is the n—dimensional response variable, X is the nxp design matrix 
and e is the identical and independent distributed noise vector. The noise 
has E(ei) = 0,Var(ei) = Vi G {!,...,n}. Assume that a is unknown, 
and that is the ’’true” underlying p—dimensional parameter vector of the 
linear regression model with active set So := supp(/3°). 

While trying to explain Y through different other variables, in the 
high-dimensional linear regression model, we need to set less important ex¬ 
planatory variables to zero. Otherwise we would have overfitting. This is 
the process of finding a trade-off between a good fit and a sparse solution. 
In other words we are trying to find a solution that explains our data well, 
but at the same time only uses more important variables to do so. 

The most famous and widely used estimator for the high-dimensional re¬ 
gression model is the £i—regularized version of least squares, called LASSO 
(Tibshirani [24]) 

/3l(cj) ;= argmin{||y-X/3||^-h2Aio-||/3||i} . 

l3eM.p 

Here Ai is a constant called the regularization level, which regulates how 
sparse our solution should be. Also note that the construction of the LASSO 
estimator depends on the unknown noise level cr. We moreover let ||a||i := 
I®*I a € MP denote the £i—norm and for any a G M"" we write 

ll®lln = the £ 2 — norm squared and divided by n. The LASSO 

uses the £ 1 —norm as a measure of sparsity. This measure as regulizer sets a 
number of parameters to zero. 

Let us rewrite the LASSO into the following form 

h = arg^mm|(^||y-A:/3||n +A'(/3)||/3 ||i) • > 

where A^(/3) := Instead of minimizing with A^(/3), a function of (3, 

let us assume that we keep A (/?) a fixed constant. Then we get the Square 
Root LASSO method 

PsrL ■= argmin{||y - X/3\\n + A||/3||i} . 

/3eiRp 

So in some sense the A for the Square Root LASSO is a scaled version, scaled 
by an adaptive estimator of a, of Ai from the LASSO. By the optimality 
conditions it is true that 

h{\\Y - X$srL\\n) = PsrL. 
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The Square Root LASSO was introduced by Belloni et al. [3] in order to get 
a pivotal method. An equivalent formulation as a joint convex optimization 
program can be found in Owen [20]. This method has been studied under 
the name Scaled LASSO in Sun and Zhang [23]. Pivotal means that the 
theoretical A does not depend on the unknown standard deviation a or on 
any estimated version of it. The estimator does not require the estimation 
of the unknown a. Belloni et al. [4] also showed that under Gaussian noise 
the theoretical A can be chosen of order — a/2p)/\/n — 1, with 

denoting the inverse of the standard Gaussian cumulative distribution func¬ 
tion, and a being some small probability. This is independent of a and 
achieves a near oracle inequality for the prediction norm of convergence rate 
In contrast to that, the theoretical penalty level of the 
LASSO depends on knowing a in order to achieve similar oracle inequalities 
for the prediction norm. 

The idea of the square root LASSO was further developed in Bunea et al. 
[10] to the group square root LASSO, in order to get a selection of groups of 
predictors. The group LASSO norm is another way to describe an underlying 
sparsity, namely if groups of parameters should be set to zero, instead of 
individual parameters. Another extension for the the square root LASSO in 
the case of matrix completion was given by Klopp [12]. 

Now in this paper we go further and generalize the idea of the square root 
LASSO to any sparsity inducing norm. From now on we will look at the 
family of norm penalty regularization methods, which are of the following 
square root type 

j3 := argmin{||y - A/3||„ -h An(/3)} , 

/3eIRr 

where 11 is any norm on This set of regularization methods will be 
called square root regularization methods. Furthermore, we introduce the 
following notations 

e \=Y — Xj3 the residuals, 

:= max z'^x, x G the dual norm of the norm H, and 

z,^{z)<l 

/3s = {/3j : j £ S} yS C {1, and all vectors /3 G M^. 


Later we will see that describing the underlying sparsity with an appropriate 
sparsity norm can make a difference in how good the errors will be. There¬ 
fore in this paper we extend the idea of the square root LASSO with the 
£i—penalty to more general weakly decomposable norm penalties. The theo¬ 
retical A of such an estimator will not depend on a either. We introduce the 
Karush-Kuhn-Tucker conditions for these estimators and give sharp oracle 
inequalities. In the last two sections we will give some examples of different 
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norms and simulations comparing the square root LASSO with the square 
root SLOPE. 


2 Karush-Kuhn-Tucker Conditions 


As we already have seen before, these estimators need to calculate a min¬ 
imum over (5. The Karush-Kuhn-Tucker conditions characterize this min¬ 
imum. In order to formulate these optimality conditions we need some 
concepts of convex optimization. For the reader who is not familiar with 
this topic, we will introduce the subdifferential, which generalizes the dif¬ 
ferential, and give a short overview of some properties, as can be found for 
example in Bach et al. [1]. For any convex function g : MP ^ M and any 
vector w we define its subdifferential as 


dg{w) := {z G g{w') > g{w) + {w' — w) Mw' G M^}. 


The elements of dg{w) are called the subgradients of g at w. 

Let us remark that all convex functions have non empty subdifferentials at 
every point. Moreover by the definition of the subdifferential any subgradi¬ 
ent defines a tangent space w' i—)■ g{w) -|- z"^ ■ {w' — w), that goes through 
g{w) and is at any point lower than the function g. If g is differentiable at 
w, then its subdifferential at w is the usual gradient. Now the next lemma, 
which dates back to Pierre Fermat (see Bauschke and Combettes [2]), shows 
how to find a global minimum for a convex function g. 

Lemma 1 (Fermat’s Rule). For all convex functions g : MP ^ M. it holds 
that 

V ^MP is a global minimum of g 0 G dg{v). 

For any norm Q on with a; G it holds true that its subdifferential can 
be written as (see Bach et al. [1] Proposition 1.2) 


dQ,{u]) 


{z G II* (z) <1} if w = 0 

{z G MP; H* (z) = 1 /\ z'^w = n(a;)} if w 7 ^ 0. 


( 2 . 1 ) 


We are able to apply these properties to our estimator /3. Lemma 1 implies 
that 

/3 is optimal “ ^/3||n £ cKI(/3). 

A 

This means that, in the case ||e||n > 0, for the square root regularization 
estimator /3 it holds true that 

/3 is optimal 4^- ^ G (9II(/3). (2-2) 

nX\\Y - XPWn 
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By combining equation (2.1) with (2.2) we can write the KKT conditions as 




( f-f ) 

< A 

if /3 = 0 

(3 is optimal < 

n* 

V J 

= A 

if /3 7 ^ 0 . 


Wr 

_ 

t||e||n ~ 

AfI(/3) 



(2.3) 


What we might first remark about equation (2.3) is that in the case of /3 7 ^ 0 
the second part can be written as 


e^X/3/n 




This means that we in fact have equality in the generalized Cauchy-Schwartz 
Inequality for these two p—dimensional vectors. Furthermore let us remark 
that the equality 

f'xp/n = n(/S)A||e||„ 

trivially holds true for the case where /3 = 0. It is important to remark 
here that, in contrast to the KKT conditions for the LASSO, we have an 
additional ||e||n term in the expression Q* ^^ ■ This nice scaling leads 
to the property that the theoretical A is independent of a. 

With the KKT conditions we are able to formulate a generalized type of 
KKT conditions. This next lemma is needed for the proofs in the next 
chapter. 

Lemma 2. For the square root type estimator (5 we have for any /3 G 
and when ||e||n / 0 


^e'^X(/3 - /3)/n + AO(/3) < AL?(/3). 

Ihlln 

Proof. First we need to look at the inequality from the KKT’s, which holds 
in any case 

n* < A. ( 2 . 4 ) 

\n\\e\\nj 

And by the definition of the dual norm and the maximum, we have with 
(2.4) 


—^e^X(5/n < 0(/3) • max " Xjd/n 
^||n /3GMP,n(^)<l ||c||?i 

\nm\nj 

< fi(/3)A. (2.5) 
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The second equation from the KKT’s, which again holds in any case, is 


-fxi5ln = m{P). 


Now putting (2.5) and (2.6) together we get the result. 


( 2 . 6 ) 

□ 


3 Sharp Oracle Inequalities for the square root 
regularization estimators 

We provide sharp oracle inequalities for the estimator (3 with a norm 0 that 
satisfies a so called weak decomposability condition. An oracle inequality is 
a bound on the estimation and prediction errors. This shows how good these 
estimators are in estimating the parameter vector This is an extension 
of the sharp oracle results given in van de Geer [26] for LASSO type of 
estimators, which in turn was an generalization of the sharp oracle inequal¬ 
ities for the LASSO and nuclear norm penalization in Koltchinskii [13] and 
Koltchinskii et al. [14]. Let us first introduce all the necessary dehnitions 
and concepts. Some normed versions of values need to be introduced: 

ALl(/3°) 

ll^lln 

0^^*((e^X)g.) 

ra||e||n 

^Iklln 

max(A‘^, A"^") 

n*{e^X) 

^Iklln 

For example the quantity / gives the measure of the true underlying normal¬ 
ized sparsity. 0'^'" denotes a norm on which will shortly be defined 

in Assumption 11. Furthermore A"* will take the role of the theoretical (un¬ 
known) A. If we compare this to the case of the LASSO we see that instead 
of the £ 00 —norm we generalized it to the dual norm of 12. Also remark that 
in A™ a term appears. This scaling is due to the square root regular- 
ization, which will be the reason that A can be chosen independently of the 
unknown standard deviation a. Now we will give the two main assumptions 
that need to hold in order to prove the oracle inequalities. Assumption I 
deals with avoiding overhtting, and the main concern of Assumption II is 
that the norm has the desired property of promoting a structured sparse so¬ 
lution (3. We will later see, that the structured sparsity norms in Micchelli 


/ = 
A^^ = 

a5 = 

A™ = 
A° = 
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et al. [18] and Micchelli et al. [19] are all of this form. Thus, Assumption II 
is quite general. 

Assumption I (overfitting): 

If ||e||ri, = 0, then /3 does the same thing as the Ordinary Least Squares 
(OLS) estimator j3oLS^ namely it overfits. That is why we need a lower 
bound on ||e||„. In order to achieve this lower bound we make the following 
assumptions: 

p{y€{Y: min < ||e||n} I =0. 

y l3,s.t.X(^=Y J 

y{l + 2/)<l. 

The ^ term makes sure that we introduce enough sparsity (no overfitting). 

Assumption II (weak decomposability): 

Assumption II is fulfilled for a set S C p} and a norm fl on if 

this norm is weakly decomposable, and S is an allowed set for this norm. 
This was used by van de Geer [26] and goes back to Bach et al. [1]. It is an 
assumption on the structure of the sparsity inducing norm. By the triangle 
inequality we have: 

niPs^) > 0(/3) - 0(/3s). 

But we will also need to lower bound this by another norm evaluated at f3s<^ ■ 
This is motivated by relaxing the following decomposability property of the 
£i-norm: 


ll/5||i = ll/3s||i + ||/3s"=||i, V sets 5 C I, ...,p and all /3 G M^. 

This decomposability property is used to get oracle inequalities for the 
LASSO. But we can relax this property, and introduce weakly decomposable 
norms. 

Definition 1 (Weak decomposability). A norm fl in is called weakly 
decomposable for an index set S C {I, ...,p}, if there exists a norm on 
such that 

V/3 G RP L!(/3) > n{/3s) + L?^"(/3sc). 

Furthermore we call a set S allowed if fl is a weakly decomposable norm for 
this set. 

Remark. In order to get a good oracle bound, we will choose the norm 
as large as possible. We will also choose the allowed sets S in such a way 
to reflect the active set Sq. Otherwise we would of course be able to choose 
as a trivial example the empty set S = 0. 

Now that we have introduced the two main assumptions, we can introduce 
other definitions and concepts also used in van de Geer [26] . 
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Definition 2. For S an allowed set of a weakly decomposable norm 0, and 
L > 0 a constant, the kl—eigenvalue is defined as 

6n{L, S) := min {||X/35 - X^n : ^(fis) = h < L} . 

Then the Q—effeetive sparsity is defined as 

The 0—eigenvalue is the distance between the two sets (van de Geer and 
Lederer [28]) {Xfis ■ ^{fis) = 1} and {Xfis^^ ■ < L}-, see Figure 2. 

The additional discussion about these definitions will follow after the main 
theorem. The ri—eigenvalue generalizes the compatibility constant (van de 
Geer [25]). 

For the proof of the main theorem we need some small lemmas. For any vec¬ 
tor fi the (L, S')—cone condition for a norm kl is satisfied if < 

LQ{f3s), with L > 0 a constant and S an allowed set. 

The proof of Lemma 3 can be found in van de Geer [26]. It shows the 
connection between the (L,S)—cone condition and the fl—eigenvalue. We 
bound k}{l3s) by a multiple of ||X/3||„. 

Lemma 3. Let S be an allowed set of a weakly decomposable norm O and 
L > 0 a constant. Then we have that the Ll—eigenvalue is of the following 
form: 

5^{L, S) = min \ fi satisfies the cone condition and /Jg / 0 

I ^fips) 

We have n{f3s) <Tn{L,S)\\Xfi\\n. 

We will also need a lower and an upper bound for ||e||,i, as already mentioned 
in Assumption 1. The next Lemma 4 gives such bounds. 



Lemma 4. Suppose that Assumption I holds true. Then 


1 + / > > 0 . 


,TI|n / + 2 

Proof. The upper bound is obtained by the definition of the estimator 
||T - XfiWn + mfi) < ||y - Xfi^n + AL!(/?0). 


Therefore we get 

||c||n ^ Iklln + An(/I*^). 

Dividing by ||e||n and by the definition of / we get the desired upper bound. 
The main idea for the lower bound is to use the triangle inequality 

||e|U = lie - X0 - fi^)\\n > ||e|U - \\X0 - fi^)\\n, 
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and then upper bound \\X{j3—j3^)\\n- With Lemma 2 we get an upper bound 
for ||X(/3-/30)|U, 

\\xCP - P^)\\l < e^X0 - p^)/n + A||e1U(L!(/3°) - f2(/3)) 

< A°||e|U0(/3 - / 3 O) + A||e1U(L!(/?0) - n0)) 

< A0||e||„(L!(/3) + L!(/3°)) + A||e|U(Li(/30) - L!(/3)) 

< A°||e||nf2(/3) + r2(/3®)(A*^||e||n + A||e||,i). 


In the second line we used the definition of the dual norm, and the Cauchy- 
Schwartz inequality. Again by the definition of the estimator we have 

0(/3)<& + L!(/lO). 

And we are left with 


||X(/3-/3°)||n< lleiuy 

By the definition of / we get 


AO 

A 


^ ^ ^ AL>(^o) ^ A ii6iu 

ll^lln ll^lln ll^lln / 


Now we get 





>\\e\\n-\\X{P-/ 3 ^)\\, 


> Iklln - llel 


A A ||e||n 


(3.1) 


Let us rearrange equation (3.1) further in the case Iplp < 1 


AO AO 


- + 2 -/ + ^ 
A A e 


1 / > 1 - 2 


-/> 1 - 


I n 


|e|L 


/ + 2 


,k||n 

l^lln 


+ 


|2 \0 

^-^(1 + 2 /) 

In 


AO 


>l--(l + 2/) 

^ n ^ 


1 — + 2 fl Assumption I 

- ' 7 + 2 ^ 


n 
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On the other hand if 141^ > 1, we already get a lower bound which is bigger 

II ^11 Tl 

,h,„ 

□ 

Finally we are able to present the main theorem. This theorem gives sharp 
oracle inequalities on the prediction error expressed in the ^ 2 -iiorm, and the 
estimation error expressed in the hi and norms. 

Remark. Let us first briefly remark that in the Theorem 1 we need to assure 
that X* — A”* > 0. The assumption ^ < 1/a, with a chosen as in Theorem 
1, together with the fact that < A™ leads to the desired inequality 

A* _ 1 - ^(1 + 2/) ^ 1 - ^(1 + 2/) ^ A- 
A f+2 - f+2 A' 

Theorem 1. Assume that 0 < 5 < 1, and also that aA™ < A, with the 
constant a = 3(1 + /). We invoke also Assumption I (overfitting) and As¬ 
sumption II (weak decomposability) for S and hi. Here the allowed set S is 
chosen such that the active set Sj 3 := supp(/3) is a subset of S. Then it holds 
true that 


\\X0 - (3^)\\l + 2J||e|U [(A* + Xn^ifls - fl) + (A* - 
< \\X{(3 - fl^)\\l + ||6||2 [(1 + <5)(A + A-)]' Tl{Ls, S), (3.2) 

with Ls ■= and 


1 -^( 1 + 2 /) 

/ + 2 


A* := A 


A :=A(1 + /). 


Furthermore we get the two oracle inequalities 

\\xCfl - fl^)\\l<\\X{fl, - fl^)\\l 

+ lklln(l + '^)^(A + A'^*)^ • r^(L 5 ^, S'*) 

\\X{fl.-fl^)\0 




1 


+ .. 


+ 


For all fixed allowed sets S define 


25||e|U A*-A™ 

(l + ,5)2||e|U (A + A-)2 


2 (i 


A* - A’^ 




/3*(S):= argmin l\\Xif3-fl^)\\l + \\e\\l (1 + 5)(A + A”^) r2(L5,S) 

j3: supp(/3)CS' \ 


1 2 
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Then S'* is defined as 


S'* := argmin 

S allowed 

/3* := /3*(S*) 


lX(/3*(S)-/?o 


n 2 


(l + 5)(A + A’") r^(Ls,S) 


(3.3) 

(3.4) 


it attains the minimal right hand side of the oracle inequality (3.2). An 
improtant special case of equation (3.2) is to choose /3 = with S D Sq al¬ 
lowed. The term ||X(/3—/3‘^)||^ vanishes in this case and only the ^—effective 
sparsity term remains for the upper bound. But it is not obvious in which 
cases and whether /3* leads to a substantially lower bound than /3°. 

Proof. Let fi gW and let S be an allowed set containing the active set of j3. 
We need to distinguish 2 cases. The second case is the more substantial one. 
Case 1: Assume that 


{X0-fi^),X0-fi))n<-mn 


(A* + X^Ws -fi) + (A* 




Here {u,v)n ■= v'^u/n, for any two vectors u,v G M” . In this case we can 
simply use the following calculations to verify the theorem. 


\\x{fi-nrn-\\x{fi-nrn + - 

+ 25||e|U [(A* + X^nifis -fi) + (A* - A™)H^^(/3s.)' 

= 2{X0 - 0),X0 - fi))n - \\X0 - ml 
+ 25||e|U [(A* + X^nCfis -fi) + {X* - 

<-\\x{(3-mi 
< 0 

Now we can turn to the more important case. 

Case 2: Assume that 

{X0-m,X0-fi))n > -<5||e|U [(A* + xn^fis -fi) + {X* - X0n^^{m) 
We can reformulate Lemma 2 with Y — Xj3 = X{j3^ — /3) + e, then we get: 

This is equivalent to 

{X0 - m,X0 - fi))n + ||e||nAH(/3) < (e, X0 - /3))n + ||e||nAH(/3). (3.5) 
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By the definition of the dual norm and the generalized Cauchy-Schwartz 
inequality we have 


(e, X(/3 - P))n < ||e||n - /3) + 

< ||6|U(A™Q(/35-/3) + A™f^^^(/3s^)) 

Inserting this inequality into (3.5) we get 

{X0 - /I°), X(/3 - /?))„ + ||e|UAO(/3) < ||e|U (A”^f^(/35 - /3) + 

+ ||f ||nAII(/3). (3-6) 


Then by the weak decomposability and the triangle inequality in (3.6) 


(X(/3 - /3°), X(/3 - P))n + PIUA (r!(/3s) + 

< ||e|U (a”^0(/35 - /3) + + ||e1UA (o(/3s - fi) + 


By inserting the assumption of case 2 



{X0-^^),X0-P))n>-5\\e\\n 


(A* + X^Ws - /3) + (A* 




into (3.7) we get 

(A||e|U-A™||e||„-5||e|U(A*-A“))B«'(/3s=) < (A|je|U + A”^||e||„ + 5||e|U(A + A™)) n0s-P)- 
By assumption aA™ < A we have that A* > A™ (see Remark 3) and therefore 




/ A + A"^ \ 

yx* _ x^ j 


l + <5 
'l - A 


■n0s-/3). 


We have applied Lemma 4 in the last step, in order to replace the estimate 
||e||n with ||e||n. By the definition of Ls we have 

n^\Ps‘^)<Lsn0s-P). (3.8) 

Therefore with Lemma 3 we get 

-f3)<Tn{Ls,S)\\X0-f3)\\n. (3.9) 

Inserting (3.9) into (3.7), together with Lemma 4 and A < 1, we get 

{X0 - /3^),X0 - I3))n + A||e||„(A* - A™)II^^(/3sO 

< (1 + A - A)||e|U(A||e|U/||e|U + Xn^0s - P) 

< (1 + A)||e|U(A + XnrniLs, 5)||X(/3 - /3)||„ - A||e|U(A* + X^Q^s - P) 
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Because Vu, u G M, 0 < (tt — it holds true that uv < 1 / 2 (m^ + u^). 

Therefore with a = (1 + (5)||e||„(A + A™')rn(L 5 , S) and h = \\X0 — /3)||n we 
have 

(X(/3 - /3°), X(/3 - I3))n + <5||e|U(A* - + 5||e||„(A* + A”^)r!(/3s - /3) 

< ^(1 + <^)'lkll^(A + A-)2r2 (Ls, s) + \\\xC^ - ml 

Since 

2{xcp - mmw - / 3 ))n = \\x 0 - mwi - - ^iwi+wm - mi 

we get 

||X(/3 - mwl + 2-5||e||n ((A* - xn^imf^ + m + xn^0s - /?)) 

< (1 + J)2||6||2 (A + Xm^ULs, s) + \\X{(3 - ^Wl (3.10) 

This gives the sharp oracle inequality. The two oracle inequalities mentioned 
are just a split up version of inequality (3.10), where for the second oracle 
inequality we need to see that A* — A™' < A* + A™. □ 

Remark that the sharpness in the oracle inequality of Theorem 1 is the 
constant one in front of the term ||Af(/3 — /?^)||^. Because we measure a 
vector on S'* by and on the inactive set S* by the norm , we take here 
mm - m and as estimation errors. 

If we choose A of the same order as A"* (i.e. aX = A™', with a > 0 a constant), 
then we can simplify the oracle inequalities. This is comparable to the oracle 
inequalities for the LASSO, see for example Bickel et al. [5], Bunea et al. 

[ 8 ], Bunea et al. [9], van de Geer [25] and further references can be found in 
Biihlmann and van de Geer [11]. 

Corollary 1. Take X of the order of X'^ (i.e. X^ = CX, with 0 < C < ^+1) 
a eonstant). Invoke the same assumptions as in Theorem 1. Here we also 
use the same notation of an optimal /3* with S* as in equation (3.3) and 
(3.4). Then we have 

\\x0 - mm < \\xm - mwi +cm • rULs.,s.) 
mm - m+< C2 ^ ■ 

Here Ci and C 2 are the constants: 

Cl ■= (1 + m ■ lklln(/ + C + 1)^, 

C 2 ■= —-- 

2511elln ^1 - 2 C '(1 + 2 /) - C 
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(3.11) 

(3.12) 







First let us explain some of the parts of Theorem 1 in more detail. We can 
also study what happens to the bound if we additionally assume Gaussian 
errors, see Proposition 1. 

On the two parts of the oracle bound: 

The oracle bound is a trade-off between two parts, which we will discuss 
now. Let us first remember that if we set /3 = in the sharp oracle bound, 
only the term with the effective sparsity will not vanish on the right hand 
side of the bound. But due to the minimization over (5 in the definition of 
/3* we might even do better than that bound. 

The first part consisting of minimizing ||X(/3 — /3^)||^ can be thought of the 
error made due to approximation, hence we call it the approximation error. 
If we fix the support S', which can be thought of being determined by the 
second part, then minimizing ||X(/3 — /?°)||^ is just a projection onto the 
subspace spanned by S, see Figure 1. So if S has a similar structure than 
the true unknown support So of (3^, this will be small. 



Figure 1: approximation error 

The second part containing r^(L 5 ',S) is due to estimation errors. There, 
minimizing over /3 will affect the set S. We have already mentioned that. It 
is one over the squared distance between the two sets {Xfis '■ = 1} 

and {X/Igc : < L}. Figure 2 shows this distance. This means 

that if the vectors in Xg and show a high correlation the distance will 
shrink and the 11—effective sparsity will blow up, which we try to avoid. 
This distance depends also on the two chosen sparsity norms and . It 
is crucial to choose norms that reflect the true underlying sparsity in order 
to get a good bound. Also the constant Ls should be small. 
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On the randomness of the oracle bound: 

Until now, the bound still contains some random parts, for example in X^. 
In order to get rid of that random part we need to introduce the following 
sets 


T ■= s max 


n*{{e^X)w) 


n e 


n e 


< d 


where d G M, and any allowed set W. We need to choose the constant d in 
such a way, that we have a high probability for this set. In other words we 
try to bound the random part by a non random constant with a very high 
probability. In order to do this we need some assumptions on the errors. 
Here we assume Gaussian errors. Let us also remark that ^ is 

normalized by ||e||,i. This normalization occurs due to the special form of 
the Karush-Kuhn-Tucker conditions. Thus the square root of the residual 
sum of squared errors is responsible for this normalization. In fact, this 
normalization is the main reason why A does not contain the unknown vari¬ 
ance. So the square root part of the estimator makes the estimator pivotal. 
Now in the case of Gaussian errors, we can use the concentration inequality 
from Theorem 5.8 in Boucheron et al. [7] and get the following proposition. 
Define first: 


Zi : = 

n*i{eTX)w) 

n||€||n 

:= 


^2 : = 

n||e||n 

G 2 := 

■^2||e||n/0- 

Z := 

max(Zi, Z 2 ) 

V := 

max(Vi, U 2 ) 


Proposition 1. Suppose that we have i.i.d. Gaussian errors e r\j M{0, 
and that the following normalization (X^Xjn^i^i = l,Vz G {!,...,p} holds 
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true. Let i? := {z G : r2(z) < 1} he the unit LI—ball, and B 2 := 
supbgs Ll—hall and £2 —ball comparison. Then we have for alld > EV 

and A > 1 


(d-E V)^A^ 


P{T) > 1 - 2e 2 S 2 /n _ 2 e 


-n(l_A2)2 


Proof. Let us define := 


sup E (calculate it 
b&B V / 




sup Var 
b&B 


{e^X)wh 


na 



= sup 
b&B 


n 

E fe^E^-Var 

.w&W i=l 



1 




= sup b^Y • bw/n E ^li/n 

bGB ^ 

= sup 6^ • bw/n < B 2 /n. 
beB 


(3.13) 


These calculations hold true as well for instead of W. Furthermore in the 
subsequent inequalities we can subsitute W with and use Z 2 , V 2 instead 

of Zi, Vi to get an analogous result. We have h ~ AA(0, b'^/n). This 

is an almost surely continuous centred Gaussian process. Therefore we can 
apply Theorem 5.8 from Boucheron et al. [7] 


P(yi — E El > c) < e 2 fl 2 /n . 


(3.14) 


Now to get to a probability inequality for Zi we use the following calculations 


P(Zi -EEi > d) < P -EEi > dA ||e||„ > +P(||e||n < ^A) 

< P(Ei -EEiA > dA) +P(||e||n < fxA) 

<P(Ei-EEi >dA) + P(||e||„<uA) 

< + P(||e||n < crA). (3.15) 


The calculations above use the union bound and that a bigger set containing 
another set has a bigger probability. Furthermore we have applied equations 
(3.13) and (3.14). Now we are left to give a bound on P(||e||„/(T < A). For 
this we use the corollary to Lemma 1 from Laurent and Massart [15] together 
with the fact that Helln/u = \/R/n with R = /^)‘^ ~ X^{n). We 


16 










obtain 


P (i? < n — 2-v/ra) < exp(—a 




P ( < A 1 < e"t 


-n(l-A2)2 


Combining equations (3.15) and (3.16) finishes the proof: 


P(r) = P(max(Zi,Z 2 ) <d) 

= P{Zi < d n Z 2 < d) 

> F{Zi <d) + P{Z2 <d)-l 


> 1 - P(Zi >d)- P(Z2 > d) 


(d-EVpA^ 
> 1 - 2e 2S2/n 


- 2e 


-n(l_A2)2 


(3.16) 


□ 


So the probability that the event T does not occur decays exponentially. 
This is what we mean by having a very high probability. Therefore we can 

take d = t ■ + E [P] with = 1 — where t = y^log (A) and 

2 g-n /2 < Q, ensure > 0. With this we get 

P(T)>l-a. (3.17) 

T 

First remark that the term ^ is now of the right scaling, because eija ~ 
AA(0,1). This is the whole point of the square root regularization. 

Here B 2 can be thought of comparing the H—ball in direction W to the 
^ 2 —ball in direction kP, because if the norm Q is the £ 2 ~aorm, then B 2 = 1. 
Moreover, for every norm there exists a constant D such that for all fd it 
holds 

II/3II2 < ZlH(/ 3 ). 

Therefore the B 2 of H satisfies 

B2 < D'^ sup Q{bw)‘^ < D^. 

b&B 

Thus we can take _ 
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What is left to be determined is E [y]. In many cases we can use a adjusted 
version of the main theorem in Maurer and Pontil [17] for Gaussian complex¬ 
ities to obtain this expectation. All the examples below can be calculated 
in this way. So, in the case of Gaussian errors, we have the following new 
version of Corollary 1. 

Corollary 2. Take \ = t/A-D^J~^ + 'E\y], where t,5^V and D are defined 
as above. Invoke the same assumptions as in Theorem 1 and additionally 
assume Gaussian errors. Use the notation from Corollary 1. Then with 
probability 1 — a the following oracle inequalities hold true 

\\X0 - < ||A(/3* - /3^)\\l + CiA^ . rULs.,S.) 

nifis. - fi,) + < ^2 + CiA • r2 (L5., 5*)) . 


Now we still have a ||e||^ term in the constants (3.11), (3.12) of the oracle 
inequality. In order to handle this we need Lemma 1 from Laurent and 
Massart [15]. Which translates in our case to the probability inequality 

P (IJelln < 17^ (l -|- 2 x -|- 2 x^)) > 1 — exp (—n • . 

Here x > 0 is a constant. Therefore we have that is of the order of cr^ 
with exponentially decaying probability in n. We could also write this in 
the following form 

P (llell^ < . C) > 1 - exp (-| (C - V2C^)) . 

Here we can choose any constant C > 1 big enough and take the bound 
• C for in the oracle inequality. A similar bounds can be found in 
Laurent and Massart [15] for 1/jjejj^. This takes care of the random part in 
the sharp oracle bound with the Gaussian errors. 
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(a) ^i-norm 


(b) Group Lasso norm with groups 
{x},{v,z} 



1 > 0.5 > 0.3 

Figure 3: Pictorial description of how the estimator /3 works, with unit balls 
of different sparsity inducing norms. 
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4 Examples 


Here we will give some examples of estimators where our sharp oracle in¬ 
equalities hold. Figure 3 shows the unit balls of some sparsity inducing 
norms that we will use as examples. In order to give the theoretical A for 
these examples we will again assume Gaussian errors. Theorem 1 still holds 
for all the examples even for non Gaussian errors. Some of the examples 
will introduce new estimators inspired by methods similar to the square root 
LASSO. 

Square Root LASSO 

First we examine the square root LASSO, 

fisrL ■■= argminj||y - X(3\\n + A||/3||il. 

/3eIRr I J 

Here we use the norm as a sparsity measure. We know that the norm 
has the nice property to be able to set certain unimportant parameters 
individually to zero. As already mentioned the £i—norm has the following 
decomposability property for any set S 

||/3||i = ||/3s||i + ||/35^||i,V/3gMP. 

Therefore we also have weak decomposability for all subsets S C p} 

with being the norm again. Thus Assumption H is fulfilled for all 
sets S and so we are able to apply Theorem 1. 

Furthermore for the square root LASSO we have that D = 1. This is because 
the £ 2 —norm is bounded by the £ 1 —norm without any constant. So in order 
to get the value of A we need to calculate the expectation of the dual norm 
of The dual norm of ii is the ^ 00 —norm. By Maurer and Pontil [17], 
we also have 



Therefore the theoretical A for the square root LASSO can be chosen as 


A 


(t/X + 2+ \/log(|p|)) 


Even though this theoretical A is very close to being optimal, it is not opti¬ 
mal, see for example van de Geer [27]. In the special case of the .^ 1 —norm 
penalization, we can simplify Corollary 2: 
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Corollary 3 (Square Root LASSO). Take X = ^t/A + 2 + Y^log(|p|)^ , 
where t > 0 and A > 1 are chosen as in (3.17). Invoke the same assumptions 
as in Corollary 2. Then for O(-) = H-Hi, we have with probability 1 — a that 
the following oracle inequalities hold true: 

\\X0srL - /3°)||^ < ||X(/3* - + CiA2 . Tl{Ls^,S,) 

WsrL -PA\i<C2 + CiA • TULs., . 


Remark that in Corollary 3 we have an oracle inequality for the estimation 
error \\f3srL ~/3*||i in ii. This is due to the decomposability of the norm. 
In other examples we will have the sum of two norms. 


Group Square Root LASSO 

In order to set groups of variables simultaneously to zero, and not only indi¬ 
vidual variables, we will look at a different sparsity inducing norm. Namely 
a ^ 1 —type norm for grouped variables, called the group LASSO norm. The 
group square root LASSO was introduced by Bunea et al. [10] as 


f^gsrL ■= argmin 
/3eRp 


|y-A/3|U + A 


i=i 


Gjmc.h 


Here g is the total number of groups, and Gj is the set of variables that 
are in the jth group. Of course the norm is a special case of the group 
LASSO norm, when Gj = {j} and g = p. 

The group LASSO penalty is also weakly decomposable with = 0, for 
any S' = (J Gj, with any J C {!,...,g}. So here the sparsity structure 
iey 

of the group LASSO norm induces the sets S to be of the same sparsity 
structure in order to fulfil Assumption II. Therefore the Theorem 1 can also 
be applied in this case. 

How do we need to choose the theoretical A? For the group LASSO norm 
we have B 2 < 1. One can see this due to the fact that > 

^Jal + ... + Og for g positive constants. And also \Gj\ > 1 for all groups. 
Therefore 


9 

E 

i=i 


\/lTlllfc,l|2 > 


\Es 

\ *=i 


Remark that the dual norm is H*(/5) = max ||/3g,- I| 2 /\/|G,-|. With Maurer 

i<i<9 

and Pontil [17] we have 


21 










max E 


0* ((e^X)5j) 


na 


,E 




72(7 


< (2 + Vlog((7)) 


That is why A can be taken of the following form 


+ 2 + yk 


And we get a similar corollary for the group square root LASSO like the 
Corollary 3 for the square root LASSO. In the case of the group LASSO, 
there are better results for the theoretical penalty level available, see for 
example Theorem 8.1 in Biihlmann and van de Geer [11]. This takes the 
minimal group size into account. 

Square Root SLOPE 

Here we introduce a new method called the square root SLOPE estimator, 
which is also part of the square root regularization family. Let us thus take 
a look at the sorted norm with some decreasing sequence Ai > A 2 > ... > 

Ap > 0 , 

«^a(/ 3) := + ... + Ap|/3|(p). 

This was shown to be a norm by Zeng and Figueiredo [29]. 

Let TT be a permutation of {1,... ,p}. The identity permutation is denoted 
by id. In order to show weak decomposability for the norm Jx we need the 
following lemmas. 

Lemma 5 (Rearrangement Inequality). Let Pi >■■■> Pp be a decreasing 
sequence of non-negative numbers. The sum ^iPn(i) maximized over 
all permutations tt at n = id. 

Proof. The result is obvious when p = 2. Suppose now that it is true for 
sequences of length p — 1. We then prove it for sequences of length p as 
follows. Let TT be an arbitrary permutation with j := 'k{p). Then 

p 

^ ^ ^iPw{i) ^ ^ ^iP-K{i) T ^pPj' 
i=l i=l 
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By induction 


p-i j-i p 

^ ^ — 'y ^ ^if^i + y ^ 

2 = 1 2=1 2=J + 1 

P 

= Y,Xi^i+ Y. (Ai-i-w 

i¥=3 *=i+i 

p p 

= Ai/3i + Y^ {K-i — ^i)f3i — ^jPj- 

2=1 2=J + 1 


Hence we have 

p p p 

y ^ — y ^ ^iPi + ^ ^ (^i— 1 “ \)Pi + (-^j ~ ^p)Pj 

2=1 2 = 1 2=^ + 1 

P P P 

= Y + E (^*-1 - - E (^*-1 - 

2=1 2=J-|-1 i=j-\-\ 

P P 

= E^*/^*+ E (A*-i-A.)(/3,-/3,). 

2 = 1 2 =^ + 1 

Since Aj_i > A* for all 1 < i < p (defining Aq = 0) and jSi < (3j for all i > j 
we know that 

p 

Y^ (Ai-i — Aj)(/3j — 13j) < 0. 

i=j+l 


□ 


Lemma 6 . Let 


H(/3) = J;A,|/3|(,), 

2=1 


and 

r 

n^\f3s^) = Y>^p-r+i\^ks^L 

1=1 

where r = p — s and > • • • > |/ 3 |(r^ 5 'c) is the ordered sequence in f3s<^- 

ThenLl{j3) > H(/3s')+ n‘^“(/3s'c). Moreover is the strongest norm among 
all for which f2(/3) > n{l3s) + n^\l3s-) 


Proof. Without loss of generality assume /3i > • • • > /3p > 0. We have 

p 

ni(3s) + ^^^{Ps^) = Y^^^-iL 

i=l 
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for a suitable permutation tt. It follows that 

n{i3s) + n^\Ps^)<n{/3). 

To show is the strongest norm it is clear we need only to search among 
candidates of the form 

r 

= y^^Xp-r+lPnS<^(l) 

l=l 

where {Ap_^_,_;} is a decreasing positive sequence and where 7 r'^‘'(l),..., 7 r^‘'{r) 
is a permutation of indices in S^. 

This is then maximized by ordering the indices in in decreasing order. But 
then it follows that the largest norm is obtained by taking = Xp-r+i 

for alH = 1 ,..., r. □ 

The SLOPE was introduced by Bogdan et al. [ 6 ] in order to better control 
the false discovery rate, and is dehned as: 

Aslope := arg min {\\Y - X/3\\l + XJxiP)} . 

^eiRp 

Now we are able to look at the square root SLOPE, which is the estimator 
of the form: 


PsrSLOPE ■= arg min {||T - X/3||n + XJx{f3)} 


The square root SLOPE replaces the squared £ 2 —norm with a £ 2 —norm. 
With Theorem 1 we have provided a sharp oracle inequality for this new 
estimator, the square root SLOPE. 

For the SLOPE penalty we have i ?2 < ^, if Ap > 0. This is because 


jm 

Xp 


- ;5^I/5|(i) + ••• + 

Ap Ap 

i=l 

> II/3II2. 


So the bound gets scaled by the smallest A. The dual norm of the SLOPE 
is by Lemma 1 of Zeng and Figueiredo [30] 


J\W) = max 



Here := (/3(i),...,/3(fc))'^ is the vector which contains the k largest ele¬ 
ments of (3. 
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Again by Maurer and Pontil [17] we have 


max E 




na 




na 




Here we denote by 7?^ := ^ A- Therefore we can choose A as 


A 



t , 2V2 + I 
\pA V2 


+ Vlog(|i 72 |) 


Let us remark that the asymptotic minimaxity of SLOPE can be found in 
Su and Candes [22]. 


Sparse Group Square Root LASSO 

The sparse group square root LASSO can be defined similarly to the sparse 
group LASSO, see Simon et al. [21], This new method is defined as: 

PsrSGLASSO ■= arg min | \\Y - X/3\\n + A \\l3\\^ VWt 

/3eRp [ 

T 

where we have a partition as follows, Gt C {1, ...,p} Vt G 1, ...,T , \J Gt = 

t=i 

{l,...,p} and Gi H Gj = 0 j. This penalty is again a norm and 

it not only chooses sparse groups by the group LASSO penalty, but also 
sparsity inside of the groups with the 7i—norm. Define 7?(/3) := A + 

riYlit=i II/^lII 2 R^''{j3) := A ll/?]];^. Then we have weak decompos- 

ability for any set S 

R{(5s) + R^\l3s^)<R{l3). 



This is due to the weak decomposability property of the fi—norm and 

II/55II2 = ,/E < ,/e + E = WPh- Now in order to get the 

\J jes y jes j&S- 

theoretical A let us note that if we sum two norms, it is again a norm. Then 
the dual of this added norm is, because of the supremum taken over the unit 
ball, smaller than dual norm of each one of the two norms individually. So 
we can invoke the same theoretical A as with the square root LASSO 

A = (t/A + 2 + Vlog(lpl)) . 

And also the theoretical r/ like the group square root LASSO 

r] = (t/A + 2 + V'log(ff)) . 
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But of course we will not get the same Corollary, because the 0—effective 
sparsity will be different. 


Structured Sparsity 

Here we will look at the very general concept of structured sparsity norms. 
Let A C [0, oo)^ be a convex cone such that A n (0, oo)^ / 0. Then 

1 P /«2 

H(/3) = H(/3; A) := ^ 

a&A 2 \ an 

j=i \ J 

is a norm by Micchelli et al. [18]. Some special cases are for example the 
norm or the wedge or box norm. Define 



■^S ■= : o G ^}. 

Then van de Geer [26] also showed that for any As C ^ we have that the 
set S is allowed and we have weak decomposability for the norm D(/3) with 
:= D(/ 35 c, ylgc). Hence the estimator 


1 ^ f 

Ps = argmin < \\Y - Xj3\\n + Amin - y — + a 

1 aeA 2 2 —/ \ Qi 


J ( ’ 




has also the sharp oracle inequality. The dual norm is given by 


= max 
aeT(i) 


1 


a; G 


H* *(a;;^ 5 c) = max 
ae^sHi) 




u! G 
i=i 


Here ^5^(1) := {a G As<^ ■ ||a||i = 1} and ^(1) := {a G ^ : ||a||i = 1}. 
Then once again by Maurer and Pontil [17] we have 


max E 


n*{{e^X)sAAs^ 


,E 


H* ((e^X)5s;Mss 


< (2+v/log(|E(M)|) 


Here E(yl) are the extreme points of the closure of the set |||^ : « £ Al|. 


With the definition := max (^vX]r=i \IYa=i ^{^i,Sf,Asi 

That is why A can be taken of the following form 


A — ^ T (^2 + y/lc) 
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Since we do not know S* we can either upper bound ^ 5 ^ for a given norm, or 
use the fact that f^(/3) > ||/3||i and < ||/3||oo for all /3 G Therefore 

use the same A as for the square root LASSO. And we get similar corollaries 
for the structured sparsity norms like the Corollary 3 for the square root 
LASSO. 

5 Simulation: Comparison between srLASSO and 
srSLOPE 

The goal of this simulation is to see how the estimation and prediction errors 
for the square root LASSO and the square root SLOPE behave under some 
Gaussian designs. We propose Algorithm 1 to solve the square root SLOPE: 


Algorithm 1: srSLOPE 
input : a starting parameter vector, 

A a desired penalty level with a decreasing sequence, 
V the response vector, 

X the design matrix. 

output: P srSLOPE = argmin(||y - A/3||„ + AJa(/3)) 

/3eKp 

1 for i 0 to istop do 

2 i IIE Af/5j||^, 

3 j3i+i p- argmin (||y - A/3||2 + ai+i\Jx{(i)) ; 

S&KP 

4 end 


Note that in Algorithm 1 Line 3 we need to solve the usual SLOPE. To solve 
the SLOPE we have used the algorithm provided in Bogdan et al. [ 6 ]. For 
the square root LASSO we have used the R-Package flare by Li et al. [16]. 
We consider a high-dimensional linear regression model: 

E = A/3° + e, 

with n = 100 response variables and p = 500 unknown parameters. The 
design matrix X is chosen with the rows being fixed i.i.d. realizations from 
AA(0,S). Here the covariance matrix S has a Toeplitz structure 

Sij = 0.9l*-^l 

We choose i.i.d. Gaussian errors e with a variance of ci^ = 1. For the 
underlying unknown parameter vector we choose different settings. For 
each such setting we calculate the square root LASSO and the square root 
SLOPE with the theoretical A given in this paper and the A from a 8 -fold 
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Cross-validation on the mean squared prediction error. We use r = 100 rep¬ 
etitions to calculate the estimation error, the sorted £i—estimation error 
and the ^' 2 —prediction error. As for the definition of the sorted I*!—norm, 
we chose a regular decreasing sequence from 1 to 0.1 with length 500. The 
results can be found in Table 1,2,3 and 4. 

Decreasing Case: 

Here the active set is chosen as Sq = {1, 2,3,..., 7}, and 

= (4, 3.6, 3.3, 3, 2.6, 2.3, 2)^ is a decreasing sequence. 


Table 1: Decreasing /3 


theoretical A 

11/3“-/3|k Jx(l3°-p) ||X(/3“-/3)||,, 

Cross-validated A 

11/3“ -/ 3||4 Ja(/3“-/3) ||X(/3“-/3)||,, 

srSLOPE 2.06 0.21 4.12 

2.37 0.26 3.88 

srLASSO 1.85 0.19 5.51 

1.78 0.19 5.05 


Decreasing Random Case: 

The active set was randomly chosen to be 5*0 = {154,129, 276, 29, 233, 240, 402} 
and again = (4, 3.6, 3.3, 3, 2.6, 2.3, 2)^. 


Table 2: Decreasing Random /3 




theoretical A 


Cross-validated A 


11/3“-^Ik 

k(/3“-/3) ||X(/3“-^)||,, 

11/3“-kk 

Ja(/ 3“-^) ||X(/30-/3)|k, 

srSLOPE 

4.50 

0.49 7.74 

7.87 

1.09 7.68 

srLASSO 

8.48 

0.89 29.47 

7.81 

0.85 9.19 

Grouped Case: 




Now in 

order to 

see if the square root 

SLOPE 

can catch grouped vari- 

ables better than the square root LASSO we look at an active set Sq = 

{1,2,3,. 

.., 7} together with = (4,4,4,3,3, 2, 2) 

T 



Table 3: Grouped (3 




theoretical A 


Cross-validated A 


11/3“-/3|k 

Ja(/3“-/3) \\X{p°-mi, 

11/3“-/3|k 

Ja(/ 3“-/3) ||X(/3“-/3)|k, 

srSLOPE 

2.81 

0.29 6.43 

1.71 

0.18 3.65 

srLASSO 

3.02 

0.31 8.37 

1.83 

0.19 4.25 


Grouped Random Case: 


Again we take the same randomly chosen set So = {154,129, 276, 29, 233, 240,402} 
with = (4,4,4, 3, 3, 2, 2)^. 


Table 4: Grouped Random /3 


theoretical A 

Cross-validated A 

11/3“k(/3“-/3) 

IA(/3“-«lk 

11/3“-4lk k(/3“-^) ||X(/30-/3)|k, 

srSLOPE 6.05 0.66 

12.84 

5.80 0.66 5.78 

srLASSO 16.90 1.77 

66.68 

6.14 0.67 6.67 
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The random cases usually lead to larger errors for both estimators. This 
is due to the correlation structure of the design matrix. The square root 
SLOPE seems to outperform the square root LASSO in the cases where (5^ is 
somewhat grouped (grouped in the sense that amplitudes of same magnitude 
appear). This is due to the structure of the sorted £i—norm, which has some 
of the sparsity properties of ii as well as some of the grouping properties 
of iooi see Zeng and Figueiredo [29]. Therefore the square root SLOPE 
reflects the underlying sparsity structure in the grouped cases. What is also 
remarkable is that the square root SLOPE always has a better mean squared 
prediction error than the square root LASSO. This is even in cases, where 
square root LASSO has better estimation errors. The estimation errors seem 
to be better for the square root LASSO in the decreasing cases. 

6 Discussion 

Sparsity inducing norms different from ii may be used to facilitate the in¬ 
terpretation of the results. Depending on the sparsity structure we have 
provided sharp oracle inequalities for square root regularization. Due to 
the square root regularizing we do not need to estimate the variance, the 
estimators are all pivotal. Moreover, because the penalty is a norm the 
optimization problems are all convex, which is a practical advantage when 
implementing the estimation procedures. For these sharp oracle inequali¬ 
ties we only needed the weak decomposability and not the decomposability 
property of the £i—norm. The weak decomposability generalizes the desired 
property of promoting an estimated parameter vector with a sparse struc¬ 
ture. The structure of the D— and norms influence the oracle bound. 
Therefore it is useful to use norms that reflect the true underlying sparsity 
structure. 
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